[NOISE]
This
lecture is about
the sentiment classification.
If we assume that
most of the elements in the opinion
representation are all ready known,
then our only task may be just a sentiment
classification, as shown in this case.
So suppose we know who's the opinion
holder and what's the opinion target,
and also know the content and the context
of the opinion, then we mainly need to
decide the opinion
sentiment of the review.
So this is a case of just using sentiment
classification for understanding opinion.
Sentiment classification can be
defined more specifically as follows.
The input is opinionated text object,
the output is typically a sentiment label,
or a sentiment tag, and
that can be designed in two ways.
One is polarity analysis, where we have
categories such as positive, negative,
or neutral.
The other is emotion
analysis that can go beyond
a polarity to characterize
the feeling of the opinion holder.
In the case of polarity analysis,
we sometimes
also have numerical ratings as you
often see in some reviews on the web.
Five might denote the most positive, and
one maybe the most negative, for example.
In general, you have just disk holder
categories to characterize the sentiment.
In emotion analysis, of course,
there are also different ways for
design the categories.
The six most frequently
used categories are happy,
sad, fearful, angry,
surprised, and disgusted.
So as you can see, the task is essentially
a classification task, or categorization
task, as we've seen before, so it's
a special case of text categorization.
This also means any textual categorization
method can be used to do sentiment
classification.
Now of course if you just do that,
the accuracy may not be good
because sentiment classification
does requires some improvement over
regular text categorization technique,
or simple text categorization technique.
In particular,
it needs two kind of improvements.
One is to use more sophisticated features
that may be more appropriate for
sentiment tagging as I
will discuss in a moment.
The other is to consider
the order of these categories, and
especially in polarity analysis,
it's very clear there's an order here,
and so these categories
are not all that independent.
There's order among them, and so
it's useful to consider the order.
For example, we could use
ordinal regression to do that,
and that's something that
we'll talk more about later.
So now, let's talk about some features
that are often very useful for
text categorization and
text mining in general, but
some of them are especially also
needed for sentiment analysis.
So let's start from the simplest one,
which is character n-grams.
You can just have a sequence
of characters as a unit,
and they can be mixed with different n's,
different lengths.
All right, and
this is a very general way and
very robust way to
represent the text data.
And you could do that for
any language, pretty much.
And this is also robust to spelling
errors or recognition errors, right?
So if you misspell a word by one character
and this representation actually would
allow you to match this word when
it occurs in the text correctly.
Right, so misspell the word and
the correct form can be matched because
they contain some common
n-grams of characters.
But of course such a recommendation
would not be as discriminating as words.
So next, we have word n-grams,
a sequence of words and again,
we can mix them with different n's.
Unigram's are actually often very
effective for a lot of text processing
tasks, and it's mostly because words
are word designed features by humans for
communication, and so
they are often good enough for many tasks.
But it's not good, or not sufficient for
sentiment analysis clearly.
For example, we might see a sentence like,
it's not good or
it's not as good as something else, right?
So in such a case if you
just take a good and
that would suggest positive that's not
good, all right so it's not accurate.
But if you take a bigram, not good
together, and then it's more accurate.
So longer n-grams are generally more
discriminative, and they're more specific.
If you match it, and it says a lot, and
it's accurate it's unlikely,
very ambiguous.
But it may cause overfitting because with
such very unique features that machine
oriented program can easily pick up
such features from the training set and
to rely on such unique features
to distinguish the categories.
And obviously, that kind of classify, one
would generalize word to future there when
such discriminative features
will not necessarily occur.
So that's a problem of
overfitting that's not desirable.
We can also consider part of speech tag,
n-grams if we can do part of
speech tagging an, for example,
adjective noun could form a pair.
We can also mix n-grams of words and
n-grams of part of speech tags.
For example, the word great might be
followed by a noun, and this could become
a feature, a hybrid feature, that could
be useful for sentiment analysis.
So next we can also have word classes.
So these classes can be syntactic like a
part of speech tags, or could be semantic,
and they might represent concepts in
the thesaurus or ontology, like WordNet.
Or they can be recognized the name
entities, like people or place, and
these categories can be used to enrich
the presentation as additional features.
We can also learn word clusters and
parodically, for example,
we've talked about the mining
associations of words.
And so we can have cluster of
paradigmatically related words or
syntaxmatically related words, and
these clusters can be features to
supplement the word base representation.
Furthermore, we can also have
frequent pattern syntax, and
these could be frequent word set,
the words that
form the pattern do not necessarily
occur together or next to each other.
But we'll also have locations where
the words my occur more closely together,
and such
patterns provide a more discriminative
features than words obviously.
And they may also generalize better
than just regular n-grams because they
are frequent.
So you expected them to
occur also in tested data.
So they have a lot of advantages, but
they might still face the problem
of overfeeding as the features
become more complex.
This is a problem in general, and the same
is true for parse tree-based features,
when you can use a parse tree to derive
features such as frequent subtrees, or
paths, and
those are even more discriminating, but
they're also are more likely
to cause over fitting.
And in general, pattern discovery
algorithm's are very useful for
feature construction because they allow
us to search in a large space of possible
features that are more complex than
words that are sometimes useful.
So in general, natural language
processing is very important that
they derive complex features, and
they can enrich text representation.
So for example,
this is a simple sentence that I showed
you a long time ago in another lecture.
So from these words we can only
derive simple word n-grams,
representations or character n-grams.
But with NLP,
we can enrich the representation
with a lot of other information such
as part of speech tags, parse trees or
entities, or even speech act.
Now with such enriching information
of course, then we can generate a lot
of other features, more complex features
like a mixed grams of a word and
the part of speech tags, or
even a part of a parse tree.
So in general, feature design actually
affects categorization accuracy
significantly, and it's a very important
part of any machine learning application.
In general, I think it would be
most effective if you can combine
machine learning, error analysis, and
domain knowledge in design features.
So first you want to
use the main knowledge,
your understanding of the problem,
the design seed features, and
you can also define a basic feature space
with a lot of possible features for
the machine learning program to work on,
and machine can be applied to select
the most effective features or
construct the new features.
That's feature learning, and
these features can then be further
analyzed by humans through error analysis.
And you can look at
the categorization errors, and
then further analyze what features can
help you recover from those errors,
or what features cause overfitting and
cause those errors.
And so this can lead into
feature validation that will
revised the feature set,
and then you can iterate.
And we might consider using
a different features space.
So NLP enriches text
recognition as I just said, and
because it enriches the feature space,
it allows much larger such a space
of features and there are also many,
many more features that can be
very useful for a lot of tasks.
But be careful not to use a lot
of category features because
it can cause overfitting,
or otherwise you would
have to training careful
not to let overflow happen.
So a main challenge in design features,
a common challenge is to optimize
a trade off between exhaustivity and
the specificity, and this trade off
turns out to be very difficult.
Now exhaustivity means we want
the features to actually have
high coverage of a lot of documents.
And so in that sense,
you want the features to be frequent.
Specifity requires the feature
to be discriminative, so
naturally infrequent the features
tend to be more discriminative.
So this really cause a trade off between
frequent versus infrequent features.
And that's why a featured
design is usually odd.
And that's probably the most important
part in machine learning any
problem in particularly in our case,
for text categoration or
more specifically
the senitment classification.
[MUSIC]

